Introduction

Hello, in this project we will do simple data visualization using the BMI Analysis dataset.

The dataset in question consists of 741 distinct records, each of which is briefly detailed with the following features:
* Age (in years): This field quantifies the age of each individual, denominated in years. It serves as a chronological reference for the dataset.
* Height (in meters): The “Height” column provides measurements of the subjects’ stature in meters. This standardized unit allows for precise representation and comparison of individuals’ heights.
* Weight (in kilograms): In the “Weight” column, the weights of the subjects are quantified in kilograms. This unit ensures consistency and accuracy in measuring the subjects’ mass.
* BMI (Body Mass Index): Derived from the height and weight columns, the BMI column computes the Body Mass Index of each individual. The calculation utilizes the formula: BMI = (Weight in kg) / (Height in m^2). BMI is a vital numerical indicator used for categorizing individuals based on their weight relative to their height. It is expressed as a continuous variable.
* BmiClass: The “BmiClass” column categorizes individuals based on their calculated BMI values. The categories include “Obese Class 1,” “Overweight,” “Underweight,” among others. These classifications are instrumental in health and weight analysis.

Data Explanatory

Data Input & Structure

First and foremost, we will input the data

data <- read.csv("C:/Users/HP/Downloads/dv-bmiclass/bmi.csv")
head(data)
##   Age Height Weight      Bmi      BmiClass
## 1  61   1.85 109.30 31.93572 Obese Class 1
## 2  60   1.71  79.02 27.02370    Overweight
## 3  60   1.55  74.70 31.09261 Obese Class 1
## 4  60   1.46  35.90 16.84181   Underweight
## 5  60   1.58  97.10 38.89601 Obese Class 2
## 6  59   1.71  79.32 27.12630    Overweight

Now, we will do the data inspection

dim(data)
## [1] 741   5

Now we found that the dataset consist of 5 column (variables) and 741 rows (individuals). Therefore, let’s find out about the data structure

str(data)
## 'data.frame':    741 obs. of  5 variables:
##  $ Age     : int  61 60 60 60 60 59 59 59 59 59 ...
##  $ Height  : num  1.85 1.71 1.55 1.46 1.58 1.71 1.7 1.72 1.46 1.83 ...
##  $ Weight  : num  109.3 79 74.7 35.9 97.1 ...
##  $ Bmi     : num  31.9 27 31.1 16.8 38.9 ...
##  $ BmiClass: chr  "Obese Class 1" "Overweight" "Obese Class 1" "Underweight" ...

As we can see, each variable has the right data types.

Missing Data Checking

Now we’ll make sure that the data isn’t consisting any missing values.

anyNA(data)
## [1] FALSE
colSums(is.na(data))
##      Age   Height   Weight      Bmi BmiClass 
##        0        0        0        0        0

Okay, we’re good to go. There are no missing values in the dataset.

Study Case

Before getting deeper analysis, we’ll call necessary packages.

library(ggplot2)
library(reshape2)
library(gcookbook)
library(dplyr)
library(magrittr)
library(plotly)
  1. Is there any correlation between Height and Weight?
cor(data$Height, data$Weight)
## [1] 0.6076716

From the result above, we can see that there are strong positive correlation between Height and Weight. The following scatter plot will explain visually the correlation of both variable completed with the BmiClass classification.

ggplotly(ggplot(data, aes(Weight, Height))+
  geom_point(aes(colour = BmiClass))+
  labs(title = "Scatter Plot of Weight and Height with the BMI Class Classified")
)
  1. What are the frequencies of each BMI Class?
bmiclass_df <- data %>%
  group_by(BmiClass) %>%
  summarize(freq = n())

bmiclass_df
## # A tibble: 6 × 2
##   BmiClass       freq
##   <chr>         <int>
## 1 Normal Weight   342
## 2 Obese Class 1    20
## 3 Obese Class 2    55
## 4 Obese Class 3    62
## 5 Overweight      166
## 6 Underweight      96
ggplot(bmiclass_df, aes(x = reorder(BmiClass, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "pink") +
  geom_text(aes(label = freq), vjust = -0.5) +  
  labs(title = "The Distribution of The Body Mass Index Category", x = "BMI Category", y = "Frequency")+
  geom_hline(yintercept = mean(bmiclass_df$freq), color ="red", linetype = 5)+
  coord_flip()

As we can see according to the bar plot above, normal weight and overweight category has an average over the mean of all categories’ frequencies.

  1. How is the distibution of each BmiClass’ Age? Do you think the data is fine? Is there any outlier in it?
ggplotly(ggplot(data, aes(BmiClass, Age))+
  geom_boxplot(fill = "pink")+
  geom_hline(yintercept = mean(data$Age), color ="red", linetype = 5)+
  labs(title = "Boxplot of Age Distribution from Each BMI Class"))

The boxplot above shows that the mean age is 31.62 years old, and that each BmiClass’ age distribution is fine because there are no outliers in those boxplots. With the boxplot below, we can see it more clearly.

ggplot(data, aes(BmiClass, Age))+
  geom_jitter(aes(col = data$Age))+
  geom_boxplot(alpha = 0.5)+
  labs(title = "Scatterplot of Each Category Boxplot's Age")
## Warning: Use of `data$Age` is discouraged.
## ℹ Use `Age` instead.

  1. The grouping of Obese Class 1, Obese Class 3, and Underweight is visually heterogenous when we look at the boxplot of number 2. Is it real?
aggregate(Age~BmiClass, data, mean)
##        BmiClass      Age
## 1 Normal Weight 27.73977
## 2 Obese Class 1 40.90000
## 3 Obese Class 2 33.21818
## 4 Obese Class 3 31.11290
## 5    Overweight 39.18072
## 6   Underweight 29.83333
aggregate(Age~BmiClass, data, sd)
##        BmiClass       Age
## 1 Normal Weight  8.641413
## 2 Obese Class 1 16.789408
## 3 Obese Class 2 13.571062
## 4 Obese Class 3  9.927780
## 5    Overweight 10.772743
## 6   Underweight 13.680310

Well, according to the result above we can see that each BmiClass’ standard deviation are below the mean value. Which means that the age distribution according to the BmiClass are homogenously.

  1. Let’s see the correlation of each numerical variables (Bmi, Height, Weight, and Age)
cor_matrix <-cor(data[,c(1:4)])
cor_melt <-melt(cor_matrix)
ggplot(cor_melt, aes(Var1, Var2, fill = value))+
  geom_tile()+
  geom_text(aes(label = round(value, 2)), color = "black", size = 3, vjust = 0.5)+
  scale_fill_gradient2(high = "magenta", midpoint = 0)+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(angle = 0, hjust = 1))

Now we can conclude the rank of correlation is listed below:
1. Bmi x Weight
2. Weight x Height
3. Bmi x Height
4. Bmi x Age
5. Weight x Age
6. Height x Age

Final Conclusion

According to the data visualization we can conclude few things such as:
1. Most of the sample are classified as Normal
2. Each BmiClass has the age distibution homogenously where there are no outliers included
3. Each numerical variable has a correlation where Bmi is most correlated with the Weight
4. There isn’t much to see in this dataset because there aren’t many variables in it, however multiple linear regression or logistic regression are recommended methods for analyzing the data.